This material is from the following webpage

http://neuralnetworksanddeeplearning.com/chap3.html

In this series of notebooks, there are techniques which can be used to improve the implementation of backpropagation and so improve the way nerual network learn

These techniques include

A better choice of cost function: cross entropy
Regularization method: L1 and L2 regularization, dropout, and artificial expansion of training data
A better method for initializing the weights in the network
A set of heuristic to help choose good hyper-parameter

%matplotlib inline
import sys
import numpy as np
from matplotlib import pyplot as plt

The cross-entropy cost function¶

Ideally, we hope and expect that our neural networks will learn fast from their errors. Is this happens in pratice? To answer this question, let's look at a toy example. The example involves a neuron with just one input:

We'll train this neuron to do something ridiculously easy: take the input 1 to the output 0. Of course, this is such a trivial task that we could easily figure out an appropriate weight and bias by hand, without using a learning algorithm. However, it turns out to be illuminating to use gradient descent to attempt to learn a weight and bias.

class ASigmoidNeuron:

    def __init__(self, w, b, lr = 0.15):
        # initialization
        self.w = w # weight
        self.b = b # bias
        self.lr = lr # learning rate

        self.x = 1.
        self.y = 0.
        self.n = 1.

    # sigmoid function
    def sigmoid(self, x):
        return 1. / (1. + np.exp(-x))
    
    # derivative of the sigmoid function
    def dsigmoid(self, sig):
        return sig * (1. - sig)

    # cost function
    def cost(self, x, y, cost_func):
        a = self.sigmoid(self.w * self.x + self.b)
        if cost_func == 'mse':
            return (a - self.y) ** 2 / 2
        elif cost_func == 'cross_entropy':
            return -(self.y * np.log(a) + (1 - self.y) * np.log((1 - a)))
        else:
            return None

    # train function
    def train(self, n_epoch = 300, cost_func = 'mse', plot = True):
        if plot:
            fig, ax = plt.subplots()
            ax.set_xlabel('Epoch')
            ax.set_ylabel('Cost')
            ax.set_ylim([0, 1.0])
            ax.set_xlim([0, n_epoch])
            epochs = []
            costs = []
            ax.plot(epochs, costs)

        for i in range(n_epoch):
            
            output = self.sigmoid(self.w * self.x + self.b)
            if cost_func == 'mse':
                self.w -= self.lr * ((self.sigmoid((self.w + 10 ** -5) * self.x + self.b) - self.y) ** 2 - (self.sigmoid(self.w * self.x + self.b) - self.y) ** 2) / (2 * 10 ** -5) # (output - self.y) * self.dsigmoid(output) * self.x / self.n
                self.b -= self.lr * ((self.sigmoid(self.w * self.x + (self.b + 10 ** -5)) - self.y) ** 2 - (self.sigmoid(self.w * self.x + self.b) - self.y) ** 2) / (2 * 10 ** -5) # (output - self.y) * self.dsigmoid(output) / self.n
            elif cost_func == 'cross_entropy':
                a = self.sigmoid(self.w * self.x + self.b)
                awh = self.sigmoid(self.w * (self.x + 10 ** -5) + self.b)
                abh = self.sigmoid(self.w * self.x + (self.b + 10 ** -5))
                self.w -= self.lr * (-(self.y * np.log(awh) + (1 - self.y) * np.log((1 - awh))) + (self.y * np.log(a) + (1 - self.y) * np.log((1 - a)))) / 10 ** -5 # (output - self.y) * self.x / self.n
                self.b -= self.lr * (-(self.y * np.log(abh) + (1 - self.y) * np.log((1 - abh))) + (self.y * np.log(a) + (1 - self.y) * np.log((1 - a)))) / 10 ** -5 # (output - self.y) / self.n
            
            if plot:
                epochs.append(i + 1)
                costs.append(self.cost(self.x, self.y, cost_func))
                ax.set_xticks([i + 1])
                line = ax.lines[0]
                line.set_xdata(epochs)
                line.set_ydata(costs)
                fig.canvas.draw()
            sys.stdout.write('\rinput : {}, w : {}, b : {}, output : {}'.format(1., self.w, self.b, output))

To make things definite, I'll pick the initial weight to be 0.6 and the initial bias to be 0.9. These are generic choices used as a place to begin learning. The initial output fron the neuron is 0.82. Run the following cell to see how the neuron learns an output much closer 0.0. The program is computing the gradient, then using the gradient to update the weight and bias, and display the result. The learning rate is $\mu=0.15$, which turns out to be slow enough that we can get substantial learning in just a few seconds. The cost is the quadratic cost function, $C$.

Quadratic cost function (mse)

$\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2, \tag{1}\end{eqnarray}$

where $a$ is the neuron's output when the training input $x=1$ is used, and $y=0$ is the corresponding desired output, $w$ denotes the collection of all weights in the network, $b$ all the biases, $n$ is the total number of training inputs, and the sum is over all training input, $x$.

ASigmoidNeuron(w = .6, b = .9).train()

input : 1.0, w : -1.286566708505344, b : -0.9780612519403515, output : 0.09430108668295109514

As you can see, the neuron rapidly learns a weight and bias that drives down the cost, and gives an output from the neuron of about 0.09. Suppose, however, that we instead choose both the starting weight and the starting bias to be 2.0. In this case the initial output is 0.98, which is very badly wrong. Let's look at how the neuron learns to output 0 in this case. Run the following cell again.

ASigmoidNeuron(w = 2., b = 2.).train()

input : 1.0, w : -0.6844833027209016, b : -0.6856472416783047, output : 0.204206282019477446492

Although this example uses the same learning rate ($\mu=0.15$), we can see that learning starts out much more slowly.

This behaviour is strange when constracted to human learning. We often learn fastest when we're badly wrong about something. But we've just seen that our artificial neuron has a lot of difficulty leanring when it's badly wrong - far more difficulty than when it's just a little wrong. What's more, it turns out that this behaviour occurs not just in this toy model, but in more general networks. Why is learning so slow? And can we find a way of avoiding this slowdown?

To understand the origin of the problem, consider that our neuron learns by changing the weight and bias at a rate determined by the partial dervatives of the cost function, $\partial C/\partial w$ and $\partial C/\partial b$. So saying 'learning is slow' is really same as saying that those partial derivatives are small. The challenge is to understand why they are samll. To understand that, that's compute the partial derivatives. Recall that we're using the quadratic cost function, which, from Equation (1), is given by

$\begin{eqnarray} C=\frac{(y-a)^2}{2}, \tag{2}\end{eqnarray}$

where $a$ is the neuron's output when the training input $x=1$ is used, and $y=0$ is the corresponding desired output. To write this more explicitly in terms of the weight and bias, recall that $a=\sigma(z)$, where $z=wx+b$. Using the chain rule to differentiate with respect to the weight and bias we get

$\begin{eqnarray} \frac{\partial C}{\partial w}=(a-y)\sigma'(z)x=a\sigma'(z) \tag{3}\end{eqnarray}$

$\begin{eqnarray} \frac{\partial C}{\partial b}=(a-y)\sigma'(z)x=a\sigma'(z), \tag{4}\end{eqnarray}$

where I have substituted $x=1$ and $y=0$. To understand the behaviour of these expression, let's look more closely at the $\sigma'(z)$ term on the right-hand side. Recall the shape of the $\sigma$ function:

def plot_sigmoid():
    %matplotlib inline
    
    sigmoid = lambda x : 1 / (1 + np.exp(-x))
    dsigmoid = lambda x : sigmoid(x) * (1 - sigmoid(x))
    
    x = np.linspace(-5, 5, 100)
    
    plt.subplot(211)
    plt.title('Sigmoid function and its derivative')
    plt.ylabel('$\sigma$')
    plt.xlabel('z')
    plt.grid()
    plt.plot(x, sigmoid(x))
    
    plt.subplot(212)
    plt.ylabel('$\sigma''$')
    plt.xlabel('z')
    plt.grid()
    plt.plot(x, dsigmoid(x))
plot_sigmoid()

We can see from this graph that when the neuron's output is close to 1, the curve gets very flat, and so $\sigma'(z)$ gets very small. Equation (3) and (4) then tell us that $\partial C/\partial w$ and $\partial C/\partial b$ get very small. This is the origin of the learning slowdown occurs for essentially the same reason in more general neural networks, not just the toy example we've been playing with.

Introducing the cross-entropy cost function¶

How can we address the learning slowdown? It turns out that we can solve the problem by replacing the quadratic cost with a different cost function, known as the cross-entropy. To understand the cross-entropy, let's move a little away from our super-simple toy model. We'll suppose instead that we're trying to traing a neuron with several input variables, $x_1$, $x_2$,..., corresponding weights $w_1$, $w_2$,..., and a bias, $b$:

The output from the neuron is, of course, $a=\sigma(z)$, where $z=\sum_j{w_j x_j+b}$ is the weighted sum of the inputs. We define the cross-entropy cost function for this neuron by

$\begin{eqnarray} C=-\frac{1}{n}\sum_x{[y\ln a+(1-y)\ln (1-a)]}, \tag{5}\end{eqnarray}$

where $n$ is the total number of items of training data, the sum is over all training inputs, $x$, and $y$ is the corresponding desired output.

It's not obvious that the expression (5) fixes the learning slowdown problem. In fact, frankly, it's not even obvious that it make sense to call this a cost function! Before addressing the learning slowdown, let's see in what sense the cross-entropy can be interpreted as a cost function.

Two properties in particular make it reasonable to interpret the cross-entropy as a cost function. First, it's non-negative, that is, $C>0$. To see this, notice that: (a) all the individual terms in the sum in (5) are negative, since both logarithms are of numbers in the range 0 to 1; and (b) there is a minus sign out the front of the sum.

Second, if the neuron's actial output os close to the desired output for all training inputs, $x$, then the cross-entropy will be close to zero**. To see this, suppose for example that $y=0$ and $a\approx0$ for some input $x$. This is a case when the neuron is doing a good job on that input. We see that the first term in the expression (5) for the cost vanishes, since $y=0$, while the second term is just $-\ln (1-a)\approx0$. A similar analysis holds when $y=1$ and $a\approx1$. And so the contribution to the cost will be low provided the actual output is close to the desired output.

Summing up, the cross-entropy is positive, and tends toward zero as the neuron gets better at computing the desired output, $y$, for all training inputs, $x$. These are both properties we'd intuitively expect for a cost function. Indeed, both properties are also satisfied by the quadratic cost, it avoids the problem of learning slowing down. To see this, let's compute the partial derivative of the cross-entropy cost with respect to the weights. We substitute $a=\sigma(z)$ in (5), and apply the chain rule twice, obtaining:

$\begin{eqnarray} \frac{\partial C}{\partial w_j}=-\frac{1}{n}\sum_x{\frac{y}{\sigma(z)}-\frac{(1-y)}{1-\sigma(z)}}\frac{\partial\sigma}{\partial w_j}\tag{6}\\ =-\frac{1}{n}\sum_x{\frac{y}{\sigma(z)}-\frac{(1-y)}{1-\sigma(z)}}\sigma'(z)x_j \tag{7}\end{eqnarray}$

Putting everything over a common denominator and simplifying this becomes:

$\begin{eqnarray} \frac{\partial C}{\partial w_j}=\frac{1}{n}\sum_x{\frac{\sigma'(z)x_j}{\sigma(z)(1-\sigma(z))}(\sigma(z) - y)} \tag{8}\end{eqnarray}$

Using the definition of the sigmoid function, $\sigma(z)=1/(1+e^(-z))$, and a little algebra we can show that $\sigma'(z)=\sigma(z)(1-\sigma(z))$. We see that $\sigma'(z)$ and $\sigma(z)(1-\sigma(z))$ terms cancel in the equation just above, and ot simplifies to become:

$\begin{eqnarray} \frac{\partial C}{\partial w_j}=\frac{1}{n}\sum_x{x_j(\sigma(z)-y)}. \tag{9}\end{eqnarray}$

This is a beautiful expression. It tells us that the rate at which the weight learns is controlled by $\sigma(z)-y$, i.e., by the error in the output. The larger the error, the faster the neuron will learn. This is just what we'd intuitively expect. In particular, it avoids the learning slowdown caused by the $\sigma'(z)$ term in the analogous equation for the quadratic cost, Equation (3). When we use the cross-entropy, the $\sigma'(z)$ tern gets canceled out, and we no longer need worry about it being small. This cancellation is the special miracle ensured by the cross-entropy cost function. Actually, it's not really a miracle. As we'll see later, the cross-entropy was specially chosen to have just this property.

In a similar way, we can compute the partial derivative for the bias.

$\begin{eqnarray} \frac{\partial C}{\partial b}=\frac{1}{n}\sum_x{(\sigma(z)-y)}. \tag{10}\end{eqnarray}$

Again, this avoids the learning slowdown caused by the $\sigma'(z)$ term in the analogous equation for the quadratic cost, Equation (4).

** To prove this I will need to assume that the desired output y are all wither 0 or 1. This is usually the case when solving classification problems, for exmaple, or when computing Boolean functions.

Let's return to the toy example we played with earlier, and explore what happens when we use the cross-entropy instead of the quadratic cost. To re-orient ourselves, we'll begin with the case where the quadratic cost did just fine, with starting weight 0.6 and starting bias 0.9. Run the following cell to see what happens when we replace the quadratic cost by the cross-entropy:

ASigmoidNeuron(w = .6, b = .9).train(cost_func = 'cross_entropy')

input : 1.0, w : 0.005212249008779979, b : -3.738182746095478, output : 0.02344358675042428523

Unsurprisingly, the neuron learns perfectly well in this instance, just as it did earlier. And now let's look at the case where our neuron got stock before (link, for comparison), with the weight and bias both starting at 2.0:

ASigmoidNeuron(w = 2., b = 2.).train(cost_func = 'cross_entropy')

input : 1.0, w : 0.005463060432636974, b : -3.7104407156336388, output : 0.0240954088276425755

Success! This time the neuron learned quickly, just as we hoped. If you observe closely you can see that the slopee of the cost curve was much steeper initially than the initial flat region on the corresponding curve for the quadratic cost. It's that steepness which the cross-entropy buys us, preventing us from getting stuck just when we'd expect our neuron to learn fastest, i.e., when the neuron starts out badly wrong.

** How can we apply gradient descent to learn in a neural network? The idea is to use gradient to find the weights $w_k$ and biases $b_l$ which minimize the cost in Equation (1). To see how this works, let's restate the gradient descent update rule, with the weights and biases replacing the variable $v_j$. In other words, our 'position' now has components $w_k$ and $b_l$, and the gradient vector $\nabla C$ has coresponding compoments $\partial C/\partial w_k$ and $\partial C/\partial b_l$. Writing out the gradient descent update rule in terms of components, we have

$\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{11}\\ b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}. \tag{12}\end{eqnarray}$